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The statistical methods used in deriving physics results in the BaBar collaboration are reviewed, with especial 
emphasis on areas where practice is not uniform in particle physics. 
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1. Introduction 

The purpose of the BaBar experiment at the PEP- 
II accelerator at SLAC is to study e+e^ colhsions in 
the 10 GeV center-of-mass region, namely the region 
around BB threshold. In particular the program is 
to investigate extensively CP violation and rare de- 
cays of B mesons, as well as topics in charm and tau 
physics. 

Here, BaBar's approach to statistical issues is sum- 
marized. Emphasis is given to areas which are often 
controversial. 



L_j 2. BaBar Analysis Organization 



BaBar is a collaboration of approximately 600 
physicists, from ~ 80 institutions in a dozen coun- 
tries. Thus, managing the production of physics re- 
sults, from initial analysis to final publication, while 
maintaining collaboration involvement is a daunting 
task. An organizational structure has been estab- 
lished to facilitate this process, as illustrated in Fig.nj 

The "Statistics Working Group" was appointed by 
the Publications Board in order to provide guidelines 
and advice on statistical matters 1]. This group is 
advisory; I'll note how well the guidelines are actually 
adopted in some cases. 



, 3. Philosophy 

X 



The approach to choosing a statistical procedure is 
to start by considering the goal. We adopt the view 
that there are two broad domains in terms of goal: 

• The first goal is that of summarizing the rel- 
evant information in a measurement. This is 
"descriptive" statistics. It is considered obliga- 
tory to report such a description of the result 
of the experiment. Inherent in this is the view 
that it is actually useful to do so, a notion that 
is not uniformly accepted. The use of frequency 
statistics is recommended for this purpose. The 
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Figure 1: BaBar analysis organization. A detailed 
analysis for some physics result is typically performed by 
a subset of the collaboration, labelled "authors" here. 
There are several layers of review that occur as an 
analysis moves towards publication: an Analysis Working 
Group interacts with the authors from the earliest stages; 
once a document is produced, a Review Committee of 
typically three people is assigned by the Publications 
Board to critically examine the analysis; upon approval 
from the Review Committee, the paper is circulated for 
collaboration-wide review, including several institutions 
specially designated to look closely at it. Oversight of the 
process and final review is carried out by the 
Publications Board. 



choice within the domain of possible frequency 
statistics is driven by an emphasis on clarity and 
the facility to compare and combine with other 
measurements. 

• The second goal is that of interpreting the rel- 
evant information in the context of making a 
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statement about "physics" . This is regarded as 
optional, since once the relevant information is 
available people are in principle able to do this 
step for themselves. Because a statement about 
physical reality may depend on other informa- 
tion, and on theoretical input, Bayesian statis- 
tics are recommended. 

It may be remarked that there may be other goals, 
such as making a decision concerning how to spend 
money for the next experiment. This would involve, 
beyond the above interpretive aspects, a consideration 
of the risks and benefits. We take the point of view 
that this is outside the scope of the analysis and re- 
porting of results, and hence do not discuss it further. 



4. Statistical Practice in BaBar 

We turn now to a review of the specific statistical 
practices recommended or adopted in BaBar analy- 
ses. Not included are the methods and tools used 
for optimizing analyses, and pattern recognition, data 
reduction, and simulation procedures. These matters 
are crucial, but here we emphasize instead areas which 
are traditionally more controversial. It should be men- 
tioned that the typical products of a BaBar physics 
analysis are: 

1. "Best" estimates for physical parameters. 

2. Interval estimates for physical parameters. 

3. Significance levels of observations (e.g., of a pos- 
sible discovery). 

4. Goodness-of-fit of models to the data. 



4.1. Blind Analysis 

Many BaBar results are obtained in "blind analy- 
ses" . The purpose of a blind analysis is to avoid the 
introduction of bias, which could occur if the analyst 
is looking at the results as the analysis is designed. 
There is more than one approach to "blindness" , see 
the talk by Aaron Roodman for a summary of 
BaBar practice. We'll give one example here. 

For example, consider the measurement of the rare 
B decay K^e'^e~ 0], of interest because of 

its sensitivity to possible physics beyond the standard 
model. The basic idea of the analysis is to look for a 
signal which peaks in the distribution of two kinematic 
variables, known as "Ai;" and "tties" (Fig-O- A fit 
is performed to this two-dimensional distribution in 
order to extract the strength of any signal present. 
However, before performing the fit, an event selection 
is made in order to suppress backgrounds. In order to 
avoid biasing the result by looking at the data while 
tuning the selection, a blind analysis is performed. 
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Figure 2: Example of a blind analysis in BaBar. The 
upper plot shows a Monte Carlo simulation of the signal 
process. The outside boundaries delimit 
the "large sideband region" ; the intermediate box is the 
"fit region" , and the inner box is the region in which the 
signal is concentrated (referred to as the "signal region", 
but in fact playing no special role in the analysis). The 
lower plot shows the BaBar data after unblinding. Here, 
the outside boundaries demarcate the fit region, and the 
smaller box is the "signal region". 



The /S.E — rriEs plane is divided into two regions: 
a region where the fit will be performed, which in- 
cludes the region where a signal might appear; and 
a larger ( "large sideband" ) region which excludes the 
fit region. During the tuning of the analysis, the data 
may not be looked at in the fit region, only in the 
large sideband region. Monte Carlo and control sam- 
ple data (including a type of data resembling signal) 
are used to tune the analysis. Once the selection crite- 
ria have been established, the fit region of the data is 
revealed, and the fit performed to extract the result. 

As BaBar is continuing to accumulate data, an issue 
arises when it is desired to update a blind analysis to 
include new data. In principle, one could simply add 
the new data, without changing the analysis. How- 
ever, this may be impractical, or undesirable. For 
example, the entire dataset may be re-reconstructed 
with improved constants or pattern recognition code. 
Or, there may have been improvements in tools such 
as particle identification. One would like to incorpo- 
rate the benefit from such improvements. Addition- 
ally, it might be desirable to work harder to optimize 
the analysis, or to optimize on different criteria, such 
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as precision instead of sensitivity. BaBar often takes 
a practical compromise approach to incoporate new 
data, and such improvements. We have the notion of 
"re-bUnding" the data, and re-optimizing. It is con- 
sidered safe in this re-optimization to use variables 
which have not been inspected too carefully in the 
blind region in the first dataset. Nonetheless, once we 
have done this, we do not refer to the new result as 
having been done with a blind analysis. 

BaBar is perhaps the first large HEP collaboration 
to have embraced the blind methodology so enthusias- 
tically. However, not every BaBar analysis is blind. In 
particular, analyses which may be called exploratory 
are generally not blinded. A recent exarnple from 
BaBar is the discovery of the Dlj{2'in)^ H, which 
was not the result of a blind analysis. There are many 
examples of people being led astray by such non-blind 
exploratory analyses, so extreme caution is warranted. 
The exploratory nature of such analyses makes it diffi- 
cult to apply rigorous methodologies with well-defined 
statistical properties. It may not be impossible to do 
better though 



4.2. Confidence Intervals 

The recommendation in BaBar is to use frequency 
statistics for summarizing information (Sect.|3J. The 
goal is to describe what is observed, stressing simplic- 
ity and coherence of interpretation, as well as facility 
in combining with other results. With these crite- 
ria, we think it can be counter-productive to impose 
"physical" constraints. There is no reason to obscure 
the observation of an "unlikely" result. Imposing con- 
straints may also complicate combination of results. 
Generally, the recommendation is to quote two-sided 
68% confidence intervals as the primary result. Where 
there may be doubt, a check for frequency validity 
(coverage) should be performed. 

4.2.1. Example in Two Dimensions 

As an example of the construction of a confidence 
region in a BaBar analysis, consider the measurement 
of D mixing and doubly Cabibbo suppressed D de- 
cays 0. In this analysis, two parameters of interest 
are to be determined, which may be expressed as x' 
and y' according to the relations: 



Am ^ Ar . ^ 

— cos (5 + — smd. 



y 



Ar 

"2F 



cos 5 



Am 



■ sin 5, 



(1) 
(2) 



where m and T are the D mass and width. Am and 
Ar are the (small) differences in masses and widths 
between the two D mass eigenstates, and 5 is an un- 
known strong phase (between Cabibbo-favored and 



doubly Cabibbo suppressed amplitudes). The mea- 
surement is only sensitive to x'"^ and y, and it is pos- 
sible that the maximum of the likelihood will occur 
at x''^ < ( "unphysical" region). At the current level 
of sensitivity, we should find a result consistent with 
x''^ = y' = 0, if the standard model is correct. 

The construction of a confidence region in the two- 
dimensional {x'^, y') plane, corresponding to 95% con- 
fidence level with the frequency interpretation, is per- 
formed as follows (Fig. : 

1. Pick a point [xq , j/g) in the plane. 

2. Form the "data" likelihood ratio comparing the 
observed maximum likelihood with the likeli- 
hood at {xQ,yQ): 



A 



Data 



•^max (Data) 



C 



(Data) ' 



(3) 



3. Simulate many experiments with {x'q, j/q) taken 
as the true values of the parameters. 

4. For each Monte Carlo simulation form the "MC" 
likelihood ratio: 



Amc — 



/:max(MC) 



c 



(MC)' 



(4) 



5. From the ensemble of simulations, determine the 
probability P(Amc > Aoata)- If this probability 
is greater than 0.95, then the point [xQ^y'^) is 
inside the contour; if less than 0.95, then the 
point is outside the contour. 

6. This procedure is repeated for many choices of 
{xq, y'o) in order to map out the contour. 

Fig. 01 shows the result of this algorithm. The choice 
was made to stop computng the contour at the bor- 
der of the "physical" region. The computation could 
in principle have been carried into the "unphysical" 
region (up to technical difficulties of the sort we shall 
discuss anon). It of course makes no difference to the 
frequency interpretation whether it is extended into 
the "unphysical" region or not. 

4.2.2. Low Statistics Issues 

Issues arise in applying the recommendation of al- 
ways quoting a two-sided interval for a parameter 
when the sampling is not from an approximate nor- 
mal distribution. Most often this involves the low- 
statistics regime of a counting process. 

The first issue is a technical one: it can happen that 
a search in parameter space wants to go into a region 
where the probability distribution is undefined. This 
is distinct from going into an "unphysical" region as 
in the example above: we'll call it crossing a "math 
boundary" . As a simple example, consider the case of 
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Figure 3: Finding a confidence contour in two 
dimensions i^j. The large filled dot shows the location of 
the maximum likelihood for the BaBar data. The open 
dot shows the value of {x'q, j/q) chosen for a simulation. 
The small dots show simulated experiments for which 
Amc > ^Data. The pluses, as well as the arrows pointing 
offscale, show simulated experiments for which 
Amc < AData. The 95% contour resulting from the 
algorithm described in the text is shown. The shaded 
region is the "unphysical" region. Note that the 
evaluation of the maximum likelihood is not restricted to 
the "physical" region. 



a normal "signal" on a flat "background" , with PDF 
(Fig.©: 




Figure 4: Graph of the example sampling PDF for two 
values of parameter 8: = 0.9, and "unphysical" 
(negative signal) value 8 = 1.1. Note that both values are 
mathematically permissible. 




Figure 5: Example of a possible dataset generated 
according to the flat background plus normal signal 
PDF. The data are displayed in histogram form by the 
points. The curve that goes negative (and is cut off at 
the plot boundary) is the result of the (unbinned) 
maximum likelihood fit. The other curve is the result of 
the same fit, except with the constraint that it cannot 
become negative. 



p{x;0) 



2 AV2^a 



e-i^, (-1,1). (5) 



The parameter of interest is the strength of the signal, 
here expressed as 1 — 0, the probability of sampling a 
signal event. An experiment samples N events from 
this distribution, with likelihood function: 



N 



C{e;{x„i^l...,N})^Y[p{x,;e). 



(6) 



It is quite possible that the likelihood will be maxi- 
mal for a value of 9 for which the PDF is not defined. 
The function p{x; 9) may become negative in some re- 
gion of X. If there are no events in this region, the 
likelihood is still "well-behaved". However, the re- 
sulting fit, as a description of the data, will typically 
look poor even where the PDF is positive. This is 
considered unacceptable. 



An illustration of a possible sampled dataset from 
this distribution is shown in Fig. |31 displayed as a 
histogram. An (unbinned) maximum likelihood fit to 
this data gives an estimate for 9 in a region outside the 
math boundary. The graph of the "PDF" curve for 
this estimate does not give a good representation of 
the data. On the other hand, if the fit is constrained 
to the math region, the graph of the PDF curve looks 
like a reasonable representation of the data. 

Thus, we suggest as a practical resolution to this 
problem to constrain the fit to remain within bounds 
such that the PDF is everywhere legitimate (n.b., pa- 
rameters may still be "unphysical"). Experience is 
that this gives fits which "look" like the data, as in the 
present example. Fig. |5l This same practical recom- 
mendation applies in interval evaluation (but coverage 
should be checked, as always). 

Another issue that arises frequently in low statis- 
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tics (Poisson) sampling may be expressed in the form 
of the following example: A "cut and count" anal- 
ysis for a branching fraction B finds n events. The 
mean expected background contribution is estimated 
as b ± cTh events. The efficiency and parent sample 
are estimated to give a scale factor (relating observed 
signal events to _B) of /icrj. The problem is to deter- 
mine a confidence interval (at 68% confidence, say), in 
the frequency sense, for B. 

We'll assume that n is sampled from a Poisson dis- 
tribution with mean /i = (n) = fB -\- 6, that b is sam- 
pled from a normal distribution, N{b, at), and that / 
is sampled from a normal distribution, N{f, Uf). Thus 
the likelihood function is: 



£{n,bJ;B,bJ)^ 



1 



(7) 

It should be noted that this example is realistic, aris- 
ing in practice (to a good approximation). A variant 

is to assume a normal distribution in (1//) 

Several methods have been proposed, and used, for 
dealing with this problem (see Ref. Q for further dis- 
cussion of these): 

1. Just give n, bzLab, fzLaf. This provides a com- 
plete summary of the relevant information, and 
should be done anyway. But it isn't a confidence 
interval for B. 

2. Integrate out the nuisance parameters according 
to 



C{n,b,f-B) = 



(8) 



df / db- 



1 ^L(i^Y-L(L^Y 



27r tT{,(T f 



This is easy, and often done. It may be inter- 
preted as a partially Bayesian approach, where 
a uniform prior has been assumed for / and b. 
The frequency properties could be investigated, 
but usually aren't. 

3. A very common approach when quoting upper 
limits is to do the appropriate Possion statisti- 
cal analysis for n, but with the scale and back- 
ground parameters fixed at the estimated values 
shifted by one standard deviation (in the direc- 
tion to make the limit higher than with the cen- 
tral values). This has the benefit of being very 
easy to do, but it is clearly ad hoc, and the cov- 
erage is usually not investigated. 

Here, I would like to comment on the possibility of 
evaluating these confidence intervals in another way. 

The method I consider is actually a very common 
method that seems to have been rather neglected as 
an approach to the present problem. The algorithm is 



1 

0.9 
0.8 
0.7 









— B=0 






— B=1 








B=2 








B=3 








— B=4 








— Normal 









c 

0) 

= 0.6 
o- 

i 0.5 
S) 0.4 

s 

4) 0.3 
I 0.2 
0.1 


1 2 3 4 5 

Delta(-lnL) 

Figure 6: Coverage frequency as a function of A for 
/ = 1, CT/ = 0.1, 6 = 0.5, (T(, = 0.1. There are several 
curves corresponding to different numbers of expected 
signal events, B. The smoothest curve is the coverage in 
the high statistics (normal) limit. 



as follows: First, find the global maximum of the like- 
lihood function with respect to B, f, b. Then search in 
the B parameter for the point where — In £ increases 
from the minimum by a specified amount (perhaps by 
A = 1/2 for a 68% confidence interval), making sure 
that the likelihood is re-maximized with respect to / 
and b during this search. The resulting points Bg, Bu 
then give an estimated interval for parameter B which 
we would like to be a confidence interval. 

The question, of course, is: Does it work? To an- 
swer this, we need to investigate the frequency prop- 
erty of the algorithm. For large statistics (i.e., the 
normal limit) we know it works — for A = 1/2 this 
method produces a 68% confidence interval for B. We 
expect that it will fail in the extreme small statistics 
limit, and the question becomes a quantitative one of 
how far it can be pushed into the low statistics regime. 
We answer this with Figs. IBHTUI 

Figure shows the dependence of the coverage of 
this algorithm on the value of A, for several values 
of B and an expected background of 1/2 event. The 
branching fraction scale is adjusted so that B may 
be interpreted as the mean number of signal events. 
It may be seen that A = 1/2 gives coverage reason- 
ably close to 68% for B > 2. Figure [7| shows the 
coverage for B = Q, for several backgrounds. Even at 
zero branching fraction, the A = 1/2 coverage is fairly 
close to 68% for expected backgrounds b > 2. Note 
that extending this to intervals with higher confidence 
may result in different conclusions. 

It may be remarked that uncertainties in the back- 
ground and/or scale factor help to obtain the desired 
coverage (Figs. |S1 and EJ. This is because they smooth 
out the effect of the discreteness of the Poisson sam- 
pling space. 

One issue is when the coverage is deemed to be 
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Figure 7: Coverage frequency as a function of A for 
B = 0, / = 1, (7/ = 0, CTf, = 0.1. There are several curves 
corresponding to different numbers of expected 
background events, b. The smoothest curve is the 
coverage in the high statistics (normal) limit. 
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Figure 8: Coverage frequency as a function of mean 
background 6 for B = 0, / = 1, cr/ = 0, A = 1/2. There 
are several curves corresponding to different values of Ub, 
becoming smoother as ot increases. The horizontal line is 
at 68%. 



Figure 9: Dependence of coverage on scale factor / and 
(7/ for B = 1, 6 = 2, CTf, = 0, A = 1/2. There are several 
curves corresponding to different values of cr/, becoming 
smoother as ct/ increases. The horizontal line is at 68%. 
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Figure 10: Coverage as a function of expected 
background for A = 0.8, B = 0, / = 1, a;, = 0. There are 
several curves corresponding to different values of ab, 
becoming smoother as (Jb increases. The horizontal line is 
at 68%. 



"good enough" . It might be suggested that if the cov- 
erage is knovifn to be within some amount, say 5% 
of 68%, that this is good enough for anything we are 
going to use those numbers for. However, one could 
also decide to take a "conservative" approach, and in- 
sist that the coverage be at least at the quoted level. 
One way to accomplish this is to shift the value of A. 
Fig. ^1 shows the coverage as a function of expected 
background (in the worst-case of zero signal branch- 
ing fraction and cr;, — 0) for a value of A = 0.8. We 
see that at least 68% coverage is guaranteed as long 
as the mean background is greater than 1.4. 

We'll conclude this discussion with a few summary 
remarks: First, it is a good idea to always quote 
n, 6 ± CTb, and / ± a/. Second, any approach used 
should be justified with a computation of the cover- 



age. The likelihood analysis studied here works pretty 
well even down to rather low statistics for 68% con- 
fidence intervals. It should be kept in mind however 
that "good enough" for 68% intervals does not imply 
good enough for other purposes, such as tests of signif- 
icance. Finally, if cr^ « 5 or cry « / this is outside the 
regime studied here; the normal assumption is likely 
invalid in this case. 

4.2.3. Interpretation Intervals 

In the interpretation stage, Bayesian intervals may 
be given, as deemed useful to the consumer. In BaBar 
practice, this is typically done when someone wants to 
give an upper limit, and is usually implemented with 
the assumption of a uniform prior in the parameter of 
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interest. BaBar recognizes the issues surrounding the 
choice of prior. The recommendation is to consider 
it carefuUy, and to make checks on how sensitive the 
result is to the choice. Even this recommendation is 
not routinely adopted however. 

4.3. Significance 

The "significance" of an observation (e.g., of the 
presence of a signal for some process) is defined as the 
probability of the observed deviation (or larger) from 
the null (no signal) model, under the null hypothesis. 
The recommended procedure in BaBar is to compute 
this probability according to the frequentist method- 
ology. It may be noted that knowing the 68% confi- 
dence interval does not always provide much insight 
into the significance. The tails of the null sampling 
distribution may be non-normal. A separate analysis 
is generally required, in which the tails are appropri- 
ately modelled. 

No recommendation is tendered for when to label 
a result as "significant". We struggled with possible 
algorithms, but eventually gave up, because such a 
label implies an interpretation. No uniform prescrip- 
tion seems to make sense; judgement is involved. For 
example, deciding that the observation of a bizarre 
new particle is significant may involve a different stan- 
dard than the claim that an expected decay mode 
of an established particle is significant. It isn't re- 
ally our primary role as experimenters; it is up to 
the reader ultimately to decide what they wish to be- 
lieve. This is perhaps the least-accepted of the Statis- 
tics Working Group's points in BaBar: people insist 
on making qualitative statements, e.g., "observation 
of" , "evidence for" , "discovery of" , "not significant" , 
"consistent with". A code exists in which "observa- 
tion of" becomes quantified as > 4(7 significance, and 
"evidence for" means > 3ct. 

This preoccupation with qualitative interpretive 
terminology is pervasive beyond BaBar. For example, 
the following excerpt appeared in Physics Today 
(itahcs mine, references deleted): 

"In March, back-to-back papers in Physical 
Review Letters reported the measurement of 
CP symmetry violation in the decay of neu- 
tral B mesons by groups in Japan and Cal- 
ifornia. Now the word ^^measuremenf has 
been replaced by observation^^ in the titles of 
two new back-to-back reports by these same 
groups in the 27 August Physical Review Let- 
ters. That is to say, with a lot more data and 
improved event reconstruction, the BaBar col- 
laboration at SLAC and the Belle collabora- 
tion at KEK in Japan have at last produced 
the first compelling evidence of CP violation in 
any system other than the neutral K mesons." 

For another example, some people think a measure- 



ment should not be called a "measurement" unless 
the result is significantly different from zero. An edi- 
tor at a prominent journal has suggested that ^^bounds 
on^^ might be more appropriate than ^^measuremenf 
in reference to a CP asymmetry angle which was ob- 
served as consistent with zero. This can lead to amus- 
ing ironies: Finding sin 2/3 = 0.00 ± 0.01 would be an 
exciting contradiction with the standard model. But 
it isn't a "measurement"? 

A further issue that arises is that many people mix 
the question of significance with the choice of interval 
(i.e., one-sided vs two-sided). This has a drawback, 
because basing how one quotes the interval based on 
the result of the measurement can introduce a bias. 
The algorithm of Feldman and Cousins is de- 
signed to address this. However, this methodology 
is not adopted in BaBar because of the constraint on 
the physical region, as discussed earlier. Instead, our 
recommendation is to always give a two-sided interval 
(if otherwise appropriate), independent of the signifi- 
cance. The significance is quoted separately. Quoting 
a one-sided interval may optionally also be done, and 
is usually regarded as part of the interpretation (hence 
a Bayesian approach is suggested). This recommen- 
dation is typically followed in BaBar, but there have 
been exceptions. 

Another issue that arises in the quoting of signifi- 
cance has to do with the tradition of quoting signifi- 
cance as na. Unfortunately, this is used to mean dif- 
ferent things: Sometimes it actually means n standard 
deviations. But sometimes it means the probability 
content of an na fluctuation for a normal distribu- 
tion. We recommend to quote directly the probability 
if the sampling distirbution is not normal. However, 
this has met with very limited implementation. 

4.4. Systematic Uncertainties 

BaBar makes many checks in a typical analysis. For 
the purpose of defining systematic uncertainties, we 
divide these into two broad categories: 

1. "Blind checks": This is a test for mistakes. No 
correction to the data is anticipated. If the test 
passes, then there is no contribution to the sys- 
tematic error. An example of such a check is 
dividing the data into two chronological subsets 
and comparing the results. 

2. "Educated checks": This is a measurement of 
biases or corrections, and may affect the quoted 
result. It involves a contribution to the system- 
atic error. An example is the model dependence 
of the efficiency calculation. 

It is recommended that the systematic uncertainty 
be quoted separately from the statistical uncertainty. 
The sources of systematic uncertainty should be de- 
scribed, and may contain statistical components, for 
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Physical (x , y') 

Central (x^, y) No CPV 

95% CL CPV allowed 

95% CL CPV allowed, stat only 

95% CL CP consen/ed 

95%> CL CP conserved, stat only 



2 2.5 



Figure 11: Incorporating systematic uncertainties in the 
confidence contour for D mixing. The filled dot is the 
location of the best fit point, the open circle the 

best fit point in the physical region. The solid contour 
(dotted if restricting to CP conserving models) shows the 
95% confidence contour according to statistical errors 
only. The dot-dash contour (dash for CP conserving 
models) shows how this contour becomes scaled on 
incorporating systematic uncertainties. 



example due to limited Monte Carlo statistics in the 
efficiency evaluation. 

We return to our earlier example (Sec. 14.2. 1|1 of 
D mixing for an example of the treatment of sys- 
tematic uncertainties. The goal here is to produce a 
two-dimensional confidence contour in the parameter 
space which incorporates the systematic uncertainites. 
In this case, the statistical uncertainties are large, and 
we are willing to accept an approximation in order 
to keep the procedure simple. Thus, it is decided to 
use a method which takes the statistics-only contour 
and scales it uniformly along rays from the best fit 
value. The scaling factor is + Y1 > where rrii is 
an estimate of systematic uncertainty i in units of the 
statistical uncertainty. This estimate is obtained by 
determining the effect of the systematic uncertainty 
on x'^,y' (the position of the best fit). Figure ITTI 
shows the result of this procedure. This method is 
conservative (or lazy) in the sense that scaling for a 
given systematic in one (worst case) direction is ap- 
plied uniformly in all directions. On the other hand, 
by evaluating the error at the best fit position, a linear 
approximation is being made. 

4.5. Goodness of Fit 

There appears to be no perfect general goodness- 
of-fit test. Given a dataset generated under the null 
hypothesis, one can usually find a test which rejects 
the null hypothesis (and this may be taken as a warn- 
ing that choosing the test after you see the data is 




Al (ps) 

Figure 12: Measurement of CP violation (BaBar) flU . 
The upper plot shows the measurement (points) of the 
time distributions for B" and B° decays to selected CP 
eigenstates. The curves show the result of a maximum 
likelihood fit to the data. The lower plot shows the 
time-dependent asymmetry between the and -B" 
decays, again with the fitted curve overlaid. The 
asymmetry would be zero in the absence of CP violation. 



dangerous). Given a dataset generated under an al- 
ternative hypothesis, one can usually find a test for 
which the null passes. It seems advisable to think 
about what one wants to test for in choosing the test. 

For example, Fig. ll2l shows data used in a measure- 
ment of CP violation by BaBar. A likelihood ratio 
(or a chi-square) test of the time distribution may be 
a good test for the lifetime fit to the data, but it may 
have little sensitivity to testing the goodness-of-fit of 
the CP asymmetry, which is a low-fequency question. 

So far, BaBar generally uses likelihood ratio tests 
or chi-square tests if appropriate. The Kolmogorov- 
Smirnov test is also used. If a test statistic such as the 
likelihood ratio is used, then a Monte Carlo evaluation 
of the distribution of the statistic is recommended, 
rather than assuming an asymptotic property. 

4.6. Consistency of Analyses 

BaBar has encountered several times the question 
of whether a new analysis is consistent with an old 
analysis. Often, the new analysis is a combination of 
additional data plus changed (improved) analysis of 
original data. The stickiest issue is handling the cor- 
relation in testing for consistency in the overlapping 
data. People sometimes have difficulty understanding 
that statistical differences can arise even comparing 
results based on the same events, so we expound on 
this. 

Given a sampling 9i , 62 from a bivariate normal dis- 
tribution N{e,ai,a2,p), with (^1) = (^2) = 0, the 
difference Ad = 62 ^ 9i is A^(0, o')-distributed with 
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cr^ = (tJ + cr| — 2pcri(T2. If the correlation is unknown, 
all we can say is that the variance of the difference is 
in the range (ci — tT2)^ . . . (ui + cr2f' ■ If we at least 
believe p > then the maximum variance of the dif- 
ference is af + 

Suppose we measure a neutrino mass, to, in a sam- 
ple of n = 10 independent events. The measurements 
are Xi,i — 1, . . . , 10. Assume the sampling distribu- 
tion for Xi is N{m, tTj). 

We may form unbiased estimator, ifii, for to: 



TOl 



(9) 



The result (from a Monte Carlo simulation) is rhi — 
0.058 ±0.039. 

Then we notice that we have some further informa- 
tion which might be useful: we know the experimen- 
tal resolutions, (7^ for each measurement. We form 
another unbiased estimator, TO2, for to; 



^2 = Eti t/sr^i ^ ± ^- (10) 



The result (from the same simulation, i.e., from the 
same events) is toi = 0.000 ± 0.016. 

The results are certainly correlated, so the question 
of consistency arises (we know the error on the differ- 
ence is between 0.023 and 0.055). In this example, the 
difference between the results is 0.058 ± 0.036, where 
the 0.036 error includes the correlation (p = 0.41). 

Art Snyder has developed an approximate formula 
for evaluating the correlation in a comparison of max- 
imum likelihood analyses. Suppose we perform two 
maximum likelihood analysis, with event likelihoods 
Ci, £2, on the same set of N events [n.b., we may use 
different information in each analysis] . The results are 
estimators 9i, 62 for parameter (restricting to the 
one-dimensional case for simplicity). The correlation 
coefficient p may be estimated according to: 



EN p din Cii I din £21 I 

d0 \e=ei d0 \e=9 

yi^i=i dS^ \e=eo J \^Z^i=i 3p~ 
where (^o is an expansion reference point): 



(11) 



Ri — 



71 - 00, 



dO^ 
d^ In £2, 

de^ 




If 00 « 01 



02, then 

N 



E 



dhiCii 
d0 



din £21 
' d0 



(12) 



where EE l/i:t,(^|,=,„) . 

Let us look at a real example of the consistency 
question in a BaBar analysis, the measurement of the 
CP-violation parameter sin 2/3. In August 2001, we 
published a result based on a dataset of 32 x 10® BB 
pairs ^12J: 



sin 2/3 = 0.59 ± 0.14(stat) ± 0.05(syst) 



(13) 



An updated result was produced in March 2002, based 
on 62 X IQ^BB pairs 



sin 2/3 = 0.75 ± 0.09(stat) ± 0.04(syst) 



(14) 



The second result includes the earlier data, re- 
reconstructcd. The analysis is not simply counting 
events; it involves multivariate maximum likelihood 
fits, reprocessing changes, and relative likelihoods for 
an event to be signal or background, for example. The 
question is, are the two results statistically consistent? 

If these were independent data sets, a difference of 
0.16 ±0.17 would not be a worry. The issue is the cor- 
relation. A specialized analysis deriving from Eqn. 1111 
is performed on the events in common between the 
two analyses. A correlation of p = 0.87 is deduced, 
yielding a difference of ~ 2.2a. This corresponds to 
a probability of 3%, which is small enough that we 
noticed, and looked hard for possible systematic prob- 
lems, but not so small to be alarming, especially in an 
experiment with many such tests being made. 

There has been some impression that BaBar may 
be seeing more diffences between old vs updated re- 
sults than people are used to, and the question arises 
whether BaBar is making mistakes. The answer to 
this seems to be, first of all, based on studies such 
as the above, there is no compelling statistical evi- 
dence to support the contention that mistakes are be- 
ing made. There should be differences, purely due to 
statistical fluctuations, among results, and BaBar sees 
nothing clearly beyond what might be expected from 
statistics. The second part of the answer is a specu- 
lation to why the impression may exist. BaBar is dif- 
ferent from most other experiments in that it makes 
extensive use of the blind methodology. There is lit- 
tle opportunity to react to observed differences with 
further changes in analysis. Without using the blind 
methodology, there is the potential for bias, tending 
towards making results agree with earlier results bet- 
ter than they should. 



5. Reflections 

It is my observation that statistical sophistication 
in particle physics (not specific to BaBar) has grown 
significantly, not so much in the choice of methods, 
which are often long-established, but in the under- 
standing attached to them. People now understand 
that there is a choice of approach between Bayesian 
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and frequency statistics, though there is yet no uni- 
form agreement on adoption. There is also consider- 
able awareness on the issue of biases in analyses, for 
example, BaBar now relies heavily on blind method- 
ology. 

BaBar adopts frequency statistics for describing re- 
sults, and much attention is devoted to Monte Carlo 
validation and verification of coverage. The use of 
the Bayesian approach in high energy physics, includ- 
ing BaBar, is still not mature: There is no established 
methodology for choosing the prior distribution, other 
than to default on a uniform prior. The justifica- 
tion for this is basically that it usually doesn't matter 
very much. There are, however, even issues still in 
frequency statistics. Controversies involve such no- 
tions as restricting to the "physical region", or that 
the presence of backgrounds should "always" lead to 
higher upper limits. Both of these notions are not a 
concern in the BaBar recommendations. 

BaBar is attempting to provide a coherent, docu- 
mented approach to its use of statistics in its results. 
This is very much a work in progress. 
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